Joining Distributed Database Summaries
نویسندگان
چکیده
The database summarization system coined SEQ provides multi-level summaries of tabular data stored into a centralized database. Summaries are computedonlinewith a conceptual hierarchical clustering algorithm. However, in many companies, data are distributed among several sites, either homogeneously (i.e. , sites contain data for a common set of features) or heterogeneously (i.e. , sites contain data for different features). Consequently, the current centralized version of SEQ is either not feasible or even not desirable due to privacy or resource issues. In this paper, we propose two new algorithms for summarizing heterogeneously distributed data without a prior "unification" of the data sources: Subspace-Oriented Join Algorithm (SOJA) and Tree Alignement-based Join Algorithm (TAJA). The main idea of such algorithms consists in applying innovative joins on two local models, computed over two disjoint sets of features, to provide a global summary over the full feature set without scanning the raw data. SOJA takes one of the two input trees as the base model and the other one is processed to complete the first one, whereas TAJA rearranges summaries by levels in a top-down manner. Then, we propose a consistent quality measure to quantify how good our joined hierarchies are. Finally, an experimental study, using synthetic data sets, shows that our joining processes (SOJA and TAJA) result in high quality clustering schemas of the entire distributed data and are very efficient in terms of computational time w.r.t. the centralized approach. Key-words: Database Summary, Distributed Clustering ∗ Atlas-Grim,INRIA/LINA-Université de Nantes † Université Internationale de Rabat in ria -0 03 46 52 8, v er si on 1 11 D ec 2 00 8 La Jointure des Résumés Distribués d’une Base de Données Résumé : Le système SaintEtiQ permet de construire, à partir d’une table relationnelle, une hiérarchie de concepts résumant cette relation. Les résumés sont générés via un algorithme de classification incrémental et chacun d’entre eux fournit une représentation concise par le biais d’un ensemble de descripteurs linguistiques sur chaque attribut d’une partie des n-uplets de la relation résumée. Les multiples niveaux de granularité qu’offre la structure hiérarchique permettent, a posteriori, d’exhiber une forme résuméede la relation à un niveau de précision voulu. Actuellement, dans lesgrandesorganisations, lesdonnées sontgéographiquement distribuées sur plusieurs sites demanière homogène (i.e. , fragmentation horizontale) ou hétérogènes (i.e. , fragmentation verticale). La répartition des données rend inapplicable la procédure de classification conceptuelle telle que définie par SaintEtiQ puisqu’elle exige que les données soient disponibles sur le serveur des résumés; cette hypothèse étant techniquement non satisfiable (i.e. , bande passante, espace de stockage, performance, etc. ) ou trop intrusive (i.e. , confidentialité). Ce travail proposedeuxalgorithmespour résumerdeux relationshétérogènes sans accéder auxdonnées d’origine: SOJA (Subspace-Oriented JoinAlgorithm) et TAJA (Tree Alignement-based Join Algorithm). Ces deux algorithmes prennent en entrée deux résumés générés localement et de manière autonome sur deux sites distincts et les combinent pour en produire un résumant la relation correspondante à la jointure des deux relations locales. Les résultats expérimentauxmontrent que SOJA et TAJA sont plus performants que l’approche centralisée (i.e. , SaintEtiQ appliqué aux relations après regroupement et jointure sur un même site) et produisent des hiérarchies semblables à celles que produit l’approche centralisée. Mots-clés : Résumé de données, Classification distribuée in ria -0 03 46 52 8, v er si on 1 11 D ec 2 00 8 Joining Distributed Database Summaries 3
منابع مشابه
Scalable Queries over Log Database Collections
Zhu, M. 2016. Scalable Queries over Log Database Collections. Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1343. 51 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-9472-8. In industrial settings, machines such as trucks, hydraulic pumps, etc. are widely distributed at different geographic locations where sensors on machines pro...
متن کاملXML Structural Summaries
This tutorial introduces the concept of XML Structural Summaries and describes their role within XML retrieval. It covers the usage of those summaries for Database-style query processing and Information Retrieval-style search tasks in the context of both centralized and distributed environments. Finally, it discusses new retrieval scenarios that can potentially be favorably supported by those s...
متن کاملDistributed Search over the Hidden Web: Hierarchical Database Sampling and Selection
Many valuable text databases on the web have non-crawlable contents that are “hidden” behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically rel...
متن کاملManufactured in The Netherlands . Data Mining in Large Databases Using DomainGeneralization
Attribute-oriented generalization summarizes the information in a relational database by repeatedly replacing speciic attribute values with more general concepts according to user-deened concept hierarchies. We introduce domain generalization graphs for controlling the generalization of a set of attributes and show how they are constructed. We then present serial and parallel versions of the Mu...
متن کاملPeerSum: Summary Management in P2P Systems
Sharing huge, massively distributed databases in P2P systems is inherently difficult. As the amount of stored data increases, data localization techniques become no longer sufficient. A practical approach is to rely on compact database summaries rather than raw database records, whose access is costly in large P2P
متن کامل